NLP Course - Statistical Language Models Notebook

Introduction to Language Models

Language models (LMs) are statistical tools that estimate the probability of a sequence of words or predict the next word in a sequence. They are foundational to natural language processing tasks like text generation, speech recognition, and machine translation.

Key Concepts

Statistical Language Models

Statistical language models rely on frequency counts from a text corpus to estimate word sequence probabilities. They are based on observed patterns in data and are commonly used in n-gram models.

Key Features

Bayes Decomposition

Bayes decomposition, or the chain rule of probability, breaks down the joint probability of a word sequence into a product of conditional probabilities.

Mathematical Formulation

P(w₁, w₂, ..., wₙ) = P(w₁) · P(w₂ | w₁) · P(w₃ | w₁, w₂) · ... · P(wₙ | w₁, ..., wₙ₋₁)

This expresses the probability of a sequence as the product of the probability of each word given all previous words.

Key Points

Maximum Likelihood Estimation

Maximum Likelihood Estimation (MLE) estimates the parameters of a language model by maximizing the likelihood of the observed data, typically using frequency counts.

Mathematical Formulation

For an n-gram model, the MLE for the conditional probability is:

P(wₙ | wₙ₋ₖ₊₁, ..., wₙ₋₁) = count(wₙ₋ₖ₊₁, ..., wₙ₋₁, wₙ) / count(wₙ₋ₖ₊₁, ..., wₙ₋₁)

Where count represents the frequency of the sequence in the corpus.

Key Points

Markov Assumption

The Markov assumption simplifies language modeling by assuming that the probability of a word depends only on a fixed number of previous words, rather than the entire sequence.

Mathematical Formulation

For an n-gram model with a Markov assumption of order k (where k = n-1):

P(wₙ | w₁, ..., wₙ₋₁) ≈ P(wₙ | wₙ₋ₖ₊₁, ..., wₙ₋₁)

For example, in a bigram model (k=1), the probability of a word depends only on the immediately preceding word.

Key Points

N-Gram Models

N-gram models are a type of statistical language model that predict the next word based on the previous n-1 words, using frequency counts from a corpus. An n-gram is a contiguous sequence of n words.

Types of N-Grams

Key Points

Example: Bigram Probability

Calculating the probability of "the cat" in a corpus.

# Example: Bigram model with NLTK from nltk import bigrams from nltk.probability import FreqDist, MLEProbDist text = ["the", "cat", "sleeps", "the", "dog", "barks"] bigram_freq = FreqDist(bigrams(text)) prob_dist = MLEProbDist(bigram_freq) print(f"Probability of 'the cat': {prob_dist.prob(('the', 'cat'))}")
Created by wikm.ir with ❤️